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ABSTRACT 

In  many  applications  involving  quantization  the  probability  distribu 
tion  of  the  input  signal  is  unknown.  However,  most  of  the  algorithms  for 
optimal  scalar  or  vector  quantization  require  an  explicit  distribution 
function  or  probability  density.  This  paper  shows  that  under  certain 
conditions  reasonable  quantizer  designs  can  be  expected  when  standard 
algorithms  are  applied  to  estimates  of  distribution  functions. 

I.  INTRODUCTION 

Much  work  has  been  expended  in  studying  algorithms  for  optimal  quanti¬ 
zation  of  a  known  probability  distribution  [1-5J.  In  practice,  however, 
the  statistical  description  of  the  source  is  rarely  known  precisely.  In 
some  recent  papers  [2,3]  a  group  of  researchers  demonstrated  both  theoreti¬ 
cally  and  experimentally  that  a  training  sequence  of  independent  or  ergodic 
samples  can  be  used  to  design  near  optimal  vector  quantizers,  by  using  the 
sample  empirical  distribution  in  place  of  the  unknown  input  distribution  in 
a  generalized  version  of  Lloyd's  Method  I  [1].  They  showed  that  under  some 
conditions  the  quantizer  designed  for  a  "long"  training  sequence  approxi¬ 
mates  closely  the  output  levels  and  performance  of  the  optimal  quantizer 
for  the  true  (unknown)  distribution.  The  same  kind  of  reasoning  should 
hold  for  any  design  algorithm.  If  the  input  distribution  F  is  not  known, 
then  we  can  form  an  estimate  Pn  based  on  n  observations  of  the  input 
signal.  As  n  becomes  large,  we  expect  a  reasonable  estimate  to  converge  to 
the  true  distribution  F.  Intuitively,  then,  an  optimal  quantizer  designed 
for  Fn,  and  the  resulting  distortion,  should  closely  approximate  those  of  an 
optimal  quantizer  for  F.  In  this  paper  we  will  establish  properties  of  an 
estimator  Fn  so  that  this  kind  of  reasoning  will  be  valid. 

II.  DEVELOPMENT 

k  k 

An  N-level  k-dimensional  vector  quantizer  is  a  mapping  Q:  IR  -*1R  which 
assigns  to  the  input  vector  x  an  output  vector  Q(x)  chosen  from  a  finite 
set  of  N  vectors  {y^ :  y^elR^,  i  =  l  ,2,.. .  ,N}.  The  distortion  incurred  in  * 

quantizing  a  k  dimensional  random  variable  X  having  a  probability  distribu-  ^ 
tion  function  F  is  expressed  by  ° 

D(Q.F)  =  j CQ(  |i  x-Q(x)  || )  dF(x)  (1) 

Ic 

where  j]  •  jj  is  the  usual  Euclidean  norm  on  IR  and  where  all  integrals, 

L 

unless  noted  otherwise,  are  over  IR  .  We  will  take  the  cost  function  Cq ( t ) 
to  be  nonnegative,  nondecreasing  on  [0,°°)  and  lower  semi -continuous.  It 
has  been  shown  previously  that  optimal  quantizers  minimizing  (1)  exist  for 
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all  probability  distributions  F  and  all  k  and  N  [6,7].  This  type  of  distor¬ 
tion  function  subsumes  most  of  the  popular  error  criteria  used  in  scalar  and 
vector  quantization.  When  optimal  quantizers  are  being  considered,  there  is 
no  loss  of  generality  in  assuming  the  "nearest  neighbor"  assignment  rule: 
Q(x)  is  that  member  of  the  set  which  is  nearest  to  x  in  Euclidean  norm,  with 
ties  being  broken  by  an  arbitrary  preassigned  method.  This  rule  will  be 
adopted  throughout  the  rest  of  the  paper.  Thus  a  quantizer  is  completely 
represented  by  its  set  of  output  vectors. 

We  will  say  that  a  sequence  of  N-level  quantizers  {Qp}  converges 
weakly  to  the  quantizer  Q  if  Qn(x)  -*-Q(x)  at  all  continuity  points  x  of  Q. 

Let  Fn  and  F  be  k- variate  probability  distribution  functions.  The 
sequence  {Fn>  is  said  to  converge  weakly  to  F  (written  Fn  F)  if  Fn(x)  -*• 
F(x)  at  every  continuity  point  x  of  F.  We  say  that  {Fn>  converges  setwise 
to  F  (denoted  Fn  F)  if 

lim  f  dF  (x)  =  f  dF(x) 
n  -*oo  n  JB 
for  every  Borel  subset  B  of  lFr . 

The  following  theorem  from  [8]  and  [9]  is  the  main  tool  used  in  the 
investigation.  Here  k  and  N  are  fixed  positive  integers. 

Theorem  1 :  Assume  that  Cg(t)  is  nonnegative  and  nondecreasing  on  [0,«>). 

Suppose  that  Cq( 1 1  x-y ( | )  is  uniformly  integrable  with  respect  to  a  sequence 
of  distribution  functions  (Fn>  for  every  y.  Let  Qn  be  an  optimal  N-level 
quantizer  for  Fn.  If  either  (a)  Cjj(t)  is  continuous  and  Fn^-F;  or 
(b)  CQ(t)  is  lower  semi  continuous  and  Fn^+F,  then  every  convergent  subse¬ 
quence  of  {Qn }  converges  weakly  to  an  optimal  quantizer  for  F.  Such  a  sub¬ 
sequence  exists  unless  CQ( (|  x ( | )  is  constant  with  F-probability  1.  More¬ 
over,  D(Qn,Fn)  converges  to  the  optimal  mean  distortion  for  quantizing  F 
with  N  levels. 

In  Theorem  1,  convergence  of  the  sequence  of  optimal  quantizers  (Q  } 
cannot  be  asserted.  However,  {Qn>  can  be  completely  partitioned  into 
convergent  subsequences  whose  limits  are  optimal  quantizers  for  F. 

In  Theorem  1,  Fn  can  be  any  sequence  of  distribution  functions.  We 

are  interested  in  having  Fn  be  an  estimate  Fn  constructed  from  n  observa¬ 
tions  of  F.  The  principal  consideration  in  applying  the  theorem  is  in 
showing  uniform  integrability.  For  sequences  of  estimates  this  becomes 

II™.  /c0(||x-y||)dFn(x)  =  /c0(||x-y||)dF(x)  (2) 

for  each  yelR*;  for  almost  all  training  sequences.  Before  proceeding 
further,  we  make  a  useful  simplification  of  (2).  This  equation  requires  us 
to  find  a  set  of  sample  sequences  (X-j,X2,...)  on  which  the  integrals 

converge  for  every  y  in  IR*5.  What  we  will  be  using  primarily  is  the  Strong 
Law  of  Large  Numbers,  which  yields  the  conclusion: 

Ji^  / CQ(  ||  x-y  || )  dFn(x)  =  j CQ(  |j  x-y  ( | )  dF(x)  (3) 

for  almost  all  training  sequences;  for  each  ye!Rk.  This  last  equation  says 
that  for  a  given  y,  there  is  a  set  A(y)  of  sample  sequences  (X^,X2>...) 

having  probability  one  for  which  the  integrals  converge.  Obviously,  (2) 
implies  (3).  However,  it  is  shown  in  the  Appendix  that  (3)  implies  (2). 
Thus  (2)  and  (3)  are  equivalent.  Hence  if  we  can  show  (3),  we  may  use 
Theorem  1  to  get  the  results  we  want.  In  the  following  examples,  we  will 
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show  how  this  can  be  done  in  a  number  of  situations. 

Empirical  Distributions.  Let  F  be  the  unknown  k-variate  distribution 
which  we  wish  to  (vector)  quantize.  If  we  take  n  independent  samples  X] ,X2, 

...,X„,  then  the  empirical  distribution  function  is  F  (x)  =  n_1(#X.  <x), 

where  the  inequality  is  taken  component  by  component.  This  is  a  simple 
nonparametric  estimator  which  by  the  Strong  Law  of  Large  Numbers  converges 
setwise  to  F  for  almost  all  sample  sequences  as  n  -*■».  Take  CQ  to  be 

nonnegative,  nondecreasing,  nonconstant  with  respect  to  F,  and  lower  semi- 
continuous.  The  demonstration  of  uniform  integrability  (3)  is  a  simple 
application  of  the  Strong  Law  of  Large  Numbers.  We  have  for  each  y 

n’-^oo  / x-y|l )  =  /c0(ll  x~y  II )  dF  wpl, 

provided  that  the^last  integral  is  finite.  According  to  Theorem  1,  then,  we 
may  quantize  the  Fn's,  using  any  available  method  that  yields  an  optimal 

solution  and  be  assured  that  the  resulting  quantizers  Qn  converge  weakly  to 

an  optimal  quantizer  for  F.  This  generalizes  the  analysis  of  [2,3]  to  other 
algorithms  besides  the  extension  of  Lloyd's  Method  I.  Instead  of  assuming 
the  training  set  to  consist  of  independent  samples,  we  may  make  an  ergodic 
(or  block  ergodic)  assumption.  This  seems  to  be  a  useful  assumption 
regarding  information  sources.  In  the  development,  the  Strong  Law  of  Large 
Numbers  can  be  replaced  by  an  ergodic  theorem  [10]  and  all  of  the 
conclusions  reached  above  will  remain  valid. 

For  the  rest  of  the  paper,  we  will  restrict  our  attention  to  distribu¬ 
tions  having  a  density,  which  we  denote  by  f.  Also,  except  for  a  portion 
of  the  last  example,  we  will  specialize  to  r-th  power  distortions,  which 
have  been  widely  used  and  studied.  In  this  case,  a  natural  procedure  would 
be  to  use  some  density  estimator  ?n  based  on  a  training  sample  X,,X?,...  in 
the  quantization  algorithm.  Then  (3)  becomes  £ 

11m  f  l|x-y||r?n(x)dx  =  f  || x-y || r f (x)dx  (4) 

for^almost  all  training  sequences;  for  each  yeIRk.  This  equation  says  that 
if  fn  and  f  are  to  behave  in  almost  the  same  manner  using  r-th  power 
quantizers,  then  they  should  have  nearly  the  same  r-th  moments  about 
arbitrary  points  y.  Viewed  differently,  the  quantity 

/  II  x-y  }| r  ?n(x)dx 

is  an  estimator  of  the  r-th  moment  about  y  of  the  density  f.  Theorem  1 
requires  this  estimator  to  be  strongly  consistent,  i.e.,  to  converge  almost 
surely  to  the  true  moment  of  f.  In  addition,  of  course,  we  want  the 
distribution  associated  with  fn  to  converge  weakly  to  the  distribution  of 
f.  This  is  satisfied  if  we  have 

fn(x)  =  f(x)  (5) 

for  almost  all  x;  for  almost  all  training  sequences.  The  discussion  thus 
far  can  be  summarized  by  saying  that  an  estimator  ?  based  on  samples  can 

be  used  in  a  quantizer  design  algorithm  in  place  of  f  if  it  satisfies  the 
strong  consistency  conditions  (4)  and  (5).  The  examples  below  illustrate 
this  point. 

Normal  densities.  Consider  that  the  density  f  to  be  quantized  is 

o 

univariate  normal  with  unknown  mean  y  and  variance  o  >0.  Let j\'(x)  denote 
the  standard  normal  density.  Then  we  may  write  f(x)  =,yV(( x-y)/a)/o.  Let 
X^,X2»...  be  independent,  identically  distributed  samples  from  f.  Strongly 

consistent  estimates  of  the  unknown  parameters  y  and  o2  are,  respectively. 


-  4  - 

A  Ap  A 

the  sample  mean  and  variance,  yn  and  Let  fn  be  a  normal  density  with 

these  parameters.  This  has  the  strong  consistency  property  (5).  Next  we 
will  show  (4).  By  a  change  of  variable 

/l x-y |r?n(x)dx  =J |onx  +  yn  -y|r>(x)dx. 

Upon  applying  the  cr  inequality  [11,  p. 157]  we  have 

|3nX*Cn-y|r<cr|3„|r|xr+cr|un-yr. 

By  the  strong  consistency  of  the  estimators, 

lim^  /(crlSnlr|x|r+crIVy|r)*y'r(x)dx  =  /(cr<Jrlxlr  +cr|y-y|  r)jV(x)dx  wpl. 

Thus  (4)  follows  upon  invoking  a  generalized  Dominated  Convergence  Theorem 
[12,  p.89]. 

The  family  of  normal  densities  in  this  example  can  be  replaced  with 
any  class  of  almost  everywhere  continuous  densities  parametrized  by  location 
and/or  scale  parameters.  We  may  contemplate  other  instances  in  which  the 
unknown  parameter  is  neither  a  location  nor  scale  parameter,  but  where  the 
above  analysis  can  be  useful.  For  instance,  the  exponent  p  in  the  general¬ 
ized  Gaussian  density  f(x)  a  K  exp{-y|x|p)  can  be  varied  to  fit  many 
histograms.  p 

In  some  situations  it  might  be  more  appropriate  to  use  a  nonparametric 
density  estimator.  A  popular  type  of  estimator  is  the  kernel  density  esti¬ 
mator  introduced  by  Parzen  [13]  for  univariate  densities  and  generalized  to 

-1  n  -It 

multivariate  densities  by  later  authors:  fn(x)*n  l  hn  K((x-X^)/hn).  The 

kernel  K(x)  is  a  probability  density  function  on  IRk  and  (hn >  is  a  sequence 
of  numbers  decreasing  to  zero.  Nadaraya  [14]  shows  for  the  univariate  case 
that  if 


f(x)  is  a  uniformly  continuous  density, 

K(x)  has  bounded  variation,  and  (6) 

l  exp(-ynh„)  converges  for  every  y>0, 
n=l  n 

A 

then  fn(x)  -*f(x)  uniformly  with  probability  one.  The  extensions  of  the 

result  to  several  dimensions  use  slightly  different  sets  of  assumptions. 

We  give  the  result  of  Moore  and  Yackel  [15]  as  a  typical  example.  If 

f(x)  is  a  uniformly  continuous  density  on  IRk, 

K(x)  is  a  bounded  density  on  lRk,  ^ 

K(x)  has  bounded  variation,  and 
2k 

n  hn  /log  n  as  n  -*■<*>, 

then  fn(x)  +  f(x)  uniformly  with  probability  one. 

Assuming  that  (6)  or  (7)  holds,  the  distribution  of  f  converges 
setwise  -to  that  of  f.  In  addition,  K  and  {hn>  should  be  chosen  appropri¬ 
ately  so  that  (4)  is  satisfied.  Two  examples  are  given  below.  The  first 
is  for  scalar  quantization  and  the  second  for  vector  quantization. 

Compact  support.  Assume  that  the  unknown  density  f  has  compact 
support  (which  we  may  take  without  loss  of  generality  to  be  contained  in 
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[-1,1])  and  is  uniformly  continuous  on  the  real  line.  We  wish  to  quantize 
this  optimally  using  a  cost  function  Cg  which  is  nonnegative,  nondecreasing 

and  lower  semi  continuous,  so  we  also  assume  that  /Cg(|x-y|)  K(x)dx  <  <»  for 

all  y.  For  convenience  we  take  the  kernel  to  be  symmetric  and  unimodal,  so 
it  decreases  away  from  the  origin.  Consider 

f  ,  C0(|x-y|)f  (x)dx  <  2  f  C0(|x-y|)yJ-  K(^-)dx  wpl. 

J|x|  >a  "  Jx>  a  u  n  "n  ' 

After  a  change  of  variable  and  simplification,  we  have 

f  C0(|x-y|)f  (x)dx  <  2  f  Cn( |h-.x  +  l-y| )  K(x)dx  wpl. 

| x |  > a  Jx>  a/h-| 

A 

It  follows  that  Cg(|x-y|)  is  uniformly  integrable  with  respect  to  { f n ( x ) } , 

wpl.  Therefore  the  kernel  estimator  is  a  viable  basis  for  designing  a 
scalar  quantizer  for  a  density  with  compact  support. 

Noncompact  support.  We  can  extend  the  previous  analysis  to  multivari¬ 
ate  densities  with  unbounded  support  if  we  restrict  attention  to  r-th  power 
quantizers.  A  bounded  continuous  density  f(x)  whose  tails  go  to  0  as 
II  x 1 1  -► 00  is  uniformly  continuous,  so  none  of  the  common  densities  are 
excluded  by  the  assumption  (7).  Let  the  cost  function  be  CQ( t)  =tr  where 

temporarily  r=2p  is  an  even  integer.  We  will  also  make  the  natural 
assumptions  that  ||x||  has  finite  (2p)-th  moments  with  respect  to  both 
densities  f  and  K.  Our  immediate  goal  is  to  show  (4)  for  r  =  2p.  By  the 
cr-inequality  [  1 1,  p.157]  we  have 

l|x-y||2pf„(x)  <c  J  I  (x(J)-y(J))Zp  K(^Kk.  (8) 

K  j=l  i*l  r,n 

where  the  superscript  denotes  the  j-th  component  of  the  k-dimensional 
vector.  Note  that  the  right-hand  side  converges  pointwise  as  n  -+-°°  to 

c.  I  (x(J)-y(J))2pf(x). 

If  the  right-hand  side  of  (8)  were  integrated,  we  get  k  terms,  a  typical 
one  of  which  looks  like  the  following  if  we  ignore  the  constant  cp: 

mj,n  =  n''  j,  /mk[hnx(J>  *x,(J)-y(J)]ZP  K(x>dx. 

We  can  use  the  binomial  theorem  to  expand  the  last  integrand.  After 
rearranging  the  sums  we  get 

J,  (*Sj,-y(j,)<:/]RkCx(j,]2p'£  K(x)dx  j . 

The  terms  in  the  summation  from  1=0  to  1  =  2p-1  are  all  multiplied  by  a 
power  of  hn,  which  decreases  to  0,  so  that  in  the  limit,  the  second  group 


of  terms  above  is  zero, 
first  term  gives 


Using  the  Strong  Law  of  Large  Numbers  on  the 


This  is  true  for  all  j  in  (8).  Therefore 

11-.  / cp  .1  (x(J)-y(J))2p  fn(x)dx  =  /Cp  l  (x^^-y^^2p  f(x)dx  wpl. 
J  1  J  *  1 

Since  |j  x-y||2pfn(x)  converges  to  j|  x-y|f2pf(x)  wpl,  we  may  use  a  general¬ 
ization  of  the  Dominated  Convergence  Theorem  [12,  p.89]  to  conclude  that  (4) 
holds  for  r=2p.  Therefore  we  have  uniform  integrability  for  the  cost  func¬ 
tion  with  r  an  even  integer.  In  general,  for  any  positive  r,  we  can  let  2p 
be  the  smallest  even  integer  for  which  r<2p.  Then  the  above  implies  that 
[16,  p.252] 

i™Jllx-yl|rVx)dx  =  Jllx-yHrf(x>dx  (9) 

Thus  the  kernel  estimator  fn  can  be  used  to  design  a  quantizer  for  f. 

Another  extension  of  the  analysis  is  possible.  Consider  a  cost  func¬ 
tion  Cn(t)  which  grows  polynomial ly  fast,  i.e.,  there  exist  a,  8  and  r  so 
that  u 

0  <  CQ(t)  <  atr+8.  (10) 

Since 

lim  C0(||x-y||)fn(x)  =  C0(||x-y||)f(x)  wpl, 

(9)  and  (10)  imply,  through  the  generalization  of  the  Dominated  Convergence 
Theorem  [12,  p.89],  that 

J1mw  J  CQ(  !|  x-y | J )  fn(x)dx  *  /cQ(  || x-y || )  f(x)dx  wpl. 

So  if  CQ  is  nondecreasing  and  lower  semi  continuous,  then  Theorem  1  also 

applies,  and  fn  can  be  used  to  design  a  quantizer  with  the  cost  function 
Cg.  Eq.  (10)  is  equivalent  to  having  a  A  >0  so  that 

Cg(2t)  <  XCg(t)  for  all  t.  (11) 

This  is  a  useful  form  for  a  cost  function  because  (11)  and  Jc0(||x|j)dF  <°° 

imply  that  f  Cq(  ||  x-y  || )  dF  <  °°  for  all  vectors  y. 

As  an  application  of  this  example,  consider  Max's  algorithm  [4]  (also 
called  Lloyd's  Method  II  [1]  or  the  Lloyd-Max  algorithm).  Some  aspects  of 
this  algorithm  were  mentioned  earlier,  and  we  noted  that  as  it  has  been 
described  in  the  literature,  the  algorithm  requires  the  existence  of  a 
known  density  function.  It  does  not  seem  to  be  readily  adapted  for 
quantizing  an  empirical  distribution  function.  However,  with  the  approach 
outlined  in  this  paper,  it  is  possible  to  use  density  estimates  as  the 
input  to  Max's  algorithm  and  get  strongly  consistent  estimates  of  the 
optimal  output  levels  and  breakpoints.  Thus  the  algorithm  can  be 
(indirectly)  driven  by  a  training  sequence  of  independent  observations. 

APPENDIX 

In  this  appendix  we  show  that  (3)  implies  (2).  Eq.  (3)  says  that  for 
a  given  yeIRk  ,  there  is  a  set  A(y)  of  sample  sequences  (X^  »X2 • . • • )  having 
probability  one  for  which  the  integrals  converge.  Since  IRk  is  separable. 
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it  is  possible  to  find  a  dense  and  countable  set  {y,}  for  which  this 

J 

convergence  holds  with  probability  one.  In  fact,  for  any  countable  subset 
of  IRk, 

oo 

A  =  H  A(y.) 
j  =  l  J 

is  a  set  of  probability  one.  Now,  for  a  given  y  in  (2),  we  may  find  a 
finite  subset  A  of  the  y.'s  so  that 

J 

C0(l|x-y||)  <  I  C0(||x-yj||). 

yj€A 

This  involves  constructing  a  cube  around  y  and  picking  the  y^  outside  the 

cube  so  that  ||x-y|j  _<  J  |  x-y .  ]|  for  at  least  one  y..  Then 

J  J 

sup  In  „  co(  U  x_y  II )  d^n(x> 

n  ‘'ll  x  ||  >  a  0 


sup  f 
yj6A  n  J  || 


1  l 


>  a 


C0(||x-yj||)dFn(X). 


For  each  y^  and  each  sample  sequence  in  A,  the  right-hand  side  can  be  made 


arbitrarily  small  by  appropriately  choosing  a.  Therefore  Cg(||x-y||)  is 
uniformly  integrable  for  every  y,  with  respect  to  every  {Fn>  arising  from 
the  set  A.  This  is  equivalent  to  (2).  Therefore,  (3)  implies  (2). 
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